Short description:

In this project we conducted spatial analysis of Kensington and Chelsea district (later referred to as K&C) in London. We used data from:

If you would like to see the source code for this analysis it is provided on GitHub repository - link in the top right corner.


Context

We chose Kensington and Chelsea district of London for our analysis because we thought it is one of the most “mixed” districts in this city. It is the smallest borough in London and the second smallest district in England; it is one of the most densely populated administrative regions in the United Kingdom. It also includes affluent areas such as Notting Hill, Kensington, South Kensington, Chelsea, and Knightsbridge. The fact that it contains many of the most expensive residential properties in the world may show some distinct social inequalities. At the 2011 census, the borough had a population of 158,649 who were 71% White, 10% Asian, 5% of multiple ethnic groups, 4% Black African and 3% Black Caribbean. A 2017 study by Trust for London and the New Policy Institute found that Kensington & Chelsea has the greatest income inequality of any London Borough. Private rent for low earners was also found to be the least affordable in London. However, the borough’s poverty rate of 28% is roughly in line with the London-wide average.

All these factors convinced us that more in-depth analysis of that district might result in some insightful outcomes.


Hypothesis

In further analysis we decided to verify 2 major hypotheses:


Loading the data

First, let us load the previously prepared data:

Census.Data <- read.csv("census_data.csv")
houseData <- read.csv("house_data.csv")

House.Points <-SpatialPointsDataFrame(houseData[,6:7], houseData,
                                      proj4string = CRS("+init=EPSG:27700"))

hist_df <- gather(Census.Data[,-1], key = "name", value = "value")

Output.Areas <- readOGR("data/statistical-gis-boundaries-london/ESRI", "OA_2011_London_gen_MHW")
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\Asus\GIT\spatial-analysis\data\statistical-gis-boundaries-london\ESRI", layer: "OA_2011_London_gen_MHW"
## with 25053 features
## It has 17 fields
Output.Areas <- Output.Areas[Output.Areas$LAD11NM=="Kensington and Chelsea",]

OA.Census <- merge(Output.Areas, Census.Data, by.y ="OA", by.x="OA11CD")
proj4string(OA.Census) <- CRS("+init=EPSG:27700")

House.Agg <- houseData %>%
  group_by(oa11) %>%
  dplyr::summarize(mean_price = mean(price_paid, na.rm=TRUE))

houses_merged <-  Census.Data %>%
  inner_join(House.Agg, by = c("OA" = "oa11"))

OA.Census.mp <- merge(Output.Areas, houses_merged, by.y ="OA", by.x="OA11CD", all = FALSE)
proj4string(OA.Census.mp) <- CRS("+init=EPSG:27700")

Exploratory Data Analysis:

We use histograms to explore denisty of each variable, take a brief look into how the dataset looks like and check for any potential model variables in the further analysis.

Now, let’s check distribution of percentages across all OA’s of religion-based variables, race-based and certain chosen variables.

We can see that the majority of inhabitants of K&C are christian - however there are many outliers in muslim religion meaning that there could be heavy muslim-biased OA districts in K&C.


Once again, there is one overwhelming group - white people, but there are few outliers in every religion, therefore meaning again that we might have some districts that would greatly differ from the rest.


Finally, looking at the distribution of chosen variables, we can see that there are for example OA’s with very low employment rate or rather high percentage of families where only children under age of 6 have English as their primiary language, so we expect (considering previous two boxplots) that whole K&C might be divised into wealthy areas, poor areas, areas mostly occupied by immigrants and religious based areas. That is a good starting point for further spatial analysis.

Now let us explore dependencies between some variables that we would expect to be somehow correlated. We can see that on scatterplots below:


Looking at the first one we can see that there is a negative correlation between percentage of highest qualification within OA and the unemployment rate. Additionally OA’s with more black/african people and of muslim faith are mostly the ones with highest unemployment rate. That indicates that there might be a problem in employment opportunities and education for immigrants.

On the second one the correlation between being employment rate and percentage of inhabitants with highest qualification is rather positive. However, what is interesting is that OA’s with the most qualified and highest employment rate are also the ones with the most white people from european countries other than United Kingdom. That might raise suspicion, that there might be racial problem for immigrants - it is far easier for you to get education and employment if you are white. One would argue that many students from across Europe aspire to study/work in London, so this might also create the bias. However this thesis is not supported by gathered data.

Correlation plot:


Variables that we expected to be correlated with each other matched the expectations. Most of the non-white non-native OA’s are also the ones with most social rented flats, biggest unemployment rate and with lowest qualification.


Spatial visualisations

Now we will explore how percentages of citizens with highest qualification, of black/african or white race distribute among OAs. All plots are interactive, so we can analyze it also with reference to the geographical location.


As we previously expected - OAs with highest densities of black/africans are also packed with people of muslim faith. On the contrary, areas with mostly white people are also the ones with highest rate of highly qualified citizens. We can also see that OAs in north of the district (North Kensington mostly) are visibly apart from the center and south (Chelsea) which is mostly occupied by natives.

However, what is truly remarkable, OAs with most people born in UK are also the ones having the most of the black/african people. So where do these highly qualified white people residing in south come from?

Let us explore how prices of houses in K&C compare with distribution of citizens that were born in EU countries:


So, there is the answer - those mostly-white OAs with high rate of qualification are the areas occupied mostly by EU born citizens. Prices in these areas are also much higher. We added distribution of unemployment - both to show that it overlaps with the previous north areas and it is negatively correlated with prices. We expect that there might not be direct cause and effect - it is probably more of a vicious cycle.


From density plots we can see that houses in the “richest” OAs are also ares with most houses sold/highest rotation of houses.

Model part

In this part we are going to focus our study on the causes of the size of the employment rate as well as the housing prices in the Royal Borough of Kensington and Chelsea.

Firstly, we need to compute polygons for our dataset, which we are going to use in later parts of the study. Doing this in two different ways results in:


Blue and red colored lines mean different methods - red ones are created using Rook’s case neighbours and blue ones are just links to neighbouring OAs. Although the methods are different, differences can hardly be seen (both visually and numerically).


Global Spatial Autocorrelation

Having the neighbours computed, we can ran Moran’s test, which will result in correlation score for our employment rate variable.

Moran I Test:

## 
##  Moran I test under randomisation
## 
## data:  OA.Census$employed  
## weights: listw    
## 
## Moran I statistic standard deviate = 14.574, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.3543126571     -0.0015873016      0.0005963801


Employed indicator has 0.34 Moran’s I statistic so it has a slight positive autocorrelation. Now, we have reasons to believe that the data does spatially cluster.


Local Spatial Autocorrelation

Analysis of the local spatial autocorrelation may result in broader conclusions.

Below, you can observe local moran statistic on our map broken down by OA:


We can clearly see that there are indeed areas, which are surrounded by units with similar values - those with positive local moran statistic. This further confirms that the data spatially cluster. However, this map does not bring us specific information about those areas/clusters.

To get the insights of each one, we are going to utilize LISA cluster map.


LISA cluster map

We set the level of significance to 0.2 for quality visualizations.


We can now observe, which clusters are of high and which of low values. North areas of Kensington have mostly low employment rates however they border with areas of high values (hence light blue color), while some of the Chelsea’s areas with high employment rates border with low values (hence light red), even though these are in the center of Chelsea. Additionally we can see a few clusters of high-high and low-low (strong red/blue color) - these are areas bordering with other similar to them in terms of employment. We can conclude, that the data is significantly clustered in means of employment.


GETIS-ORD

We can also check whether the Getis-Ord Gi statistic can help us with our analysis. Thanks to that we can broaden our analysis with proximity based neighbours instead of border based.

We set the proximity to 750 meters (which emphasize clusters most efficiently) and observe hot-spots based on intenisty of clusters, on the map below:

We can clearly see that there are three main clusters:

Let’s see if we can infer more from regression models and explain the employment rate.


Linear model - employment rate

We will start with the OLS model.

Using backward variable selection we eliminated most of the insignificant variables and came up with following model:

## 
## Call:
## lm(formula = OA.Census$employed ~ . - 1, data = OA.Census[, sig_cols_2])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.3137  -3.2513   0.3911   3.7125  19.6324 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## white          0.31933    0.02140  14.923  < 2e-16 ***
## black_african  0.28316    0.05154   5.494 5.71e-08 ***
## single         0.12605    0.01913   6.588 9.44e-11 ***
## lowest_quali   0.58775    0.10107   5.815 9.65e-09 ***
## highest_quali  0.54895    0.02709  20.267  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.733 on 626 degrees of freedom
## Multiple R-squared:  0.992,  Adjusted R-squared:  0.9919 
## F-statistic: 1.548e+04 on 5 and 626 DF,  p-value: < 2.2e-16


We obtain almost ideally fitted model, with R-squared = 0.99. It means that the variables that we chose explain it very accurately. We can further explore the coefficients- all of them are positive, however qualifications level is the strongest one. Indicator that people live alone being the weakest. We can also observe that both white and black races, surprisingly considering previous data explorations, impact the employed indicator in almost the same way.

Let’s see if residuals vary among different OAs.

There are no significant patterns, so we may expect that there are no variables that are unobserved in our analysis. However, spatial analysis using Geographically Weighted Regression (GWR) may bring more information.


GWR - employment rate

We start with calculating kernel bandwith for GWR computation.

## Call:
## gwr(formula = OA.Census$employed ~ . - 1, data = OA.Census[, 
##     sig_cols_2], adapt = GWRbandwidth, hatmatrix = TRUE, se.fit = TRUE)
## Kernel function: gwr.Gauss 
## Adaptive quantile: 0.1901549 (about 119 of 631 data points)
## Summary of GWR coefficient estimates at data points:
##                   Min.  1st Qu.   Median  3rd Qu.     Max. Global
## white         0.270580 0.288871 0.299836 0.313213 0.335517 0.3193
## black_african 0.023445 0.195534 0.242522 0.281999 0.351382 0.2832
## single        0.027817 0.052734 0.085185 0.213740 0.313398 0.1260
## lowest_quali  0.207876 0.276701 0.792704 0.904568 1.053125 0.5878
## highest_quali 0.462734 0.496910 0.596113 0.614051 0.651189 0.5490
## Number of data points: 631 
## Effective number of parameters (residual: 2traceS - traceS'S): 22.44898 
## Effective degrees of freedom (residual: 2traceS - traceS'S): 608.551 
## Sigma (residual: 2traceS - traceS'S): 5.564811 
## Effective number of parameters (model: traceS): 16.77853 
## Effective degrees of freedom (model: traceS): 614.2215 
## Sigma (model: traceS): 5.539064 
## Sigma (ML): 5.464925 
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 3970.666 
## AIC (GWR p. 96, eq. 4.22): 3950.797 
## Residual sum of squares: 18845.07 
## Quasi-global R2: 0.6938792

R-squared seems to be much lower, let’s see how it looks in division to the areas.

It seems that the model is best fitted to the areas in the north, where the local R-squared is the highest.

Same as in the linear model, all coefficients are positive. We can further explore them divided among OAs.

Coefficients based on race turned out to impact north-center areas the most. Similarly with qualifications - south part, mainly Chelsea is impacted in almost the same way with lowest and highest qualifications.

Let’s see how the models will predict the mean house prices aggregated by the OA.


LINEAR MODEL - PRICES

We will work through the same methodology as with the employment rate.

Using backward variables selection we eliminated most of the insignificant variables and came up with following model:

## 
## Call:
## lm(formula = OA.Census.mp$mean_price ~ ., data = OA.Census.mp[sig_cols])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2221130  -437500  -119641   227375  6426924 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6183339     735000   8.413 3.64e-16 ***
## single          -12374       4913  -2.519  0.01207 *  
## muslim          -17097       7559  -2.262  0.02411 *  
## highest_quali    12094       5790   2.089  0.03721 *  
## jewish           70316      23129   3.040  0.00248 ** 
## asian            -7480       9048  -0.827  0.40879    
## one_car         -38962       9201  -4.234 2.70e-05 ***
## no_cars         -44399       7206  -6.161 1.42e-09 ***
## Age_30_44       -12502       9360  -1.336  0.18220    
## employed        -15271       7348  -2.078  0.03817 *  
## private_rent      5909       3734   1.582  0.11415    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 860500 on 537 degrees of freedom
## Multiple R-squared:  0.3998, Adjusted R-squared:  0.3886 
## F-statistic: 35.76 on 10 and 537 DF,  p-value: < 2.2e-16

On this target variable, the model is not that well-fitted, its R-squared equals to 0.399. It is interesting that the percentage of Jews impacts the average price the most postively, while having one car or no car the most negatively. The intercept estimates to over 6 million. Let’s check the residuals on each of the OAs.

There are no patterns to the residuals, so we may conclude that we did not omit any significant variable in our modelling process.

Maybe GWR model will bring more value.


GWR - PRICES


## Call:
## gwr(formula = OA.Census.mp$mean_price ~ ., data = OA.Census.mp[, 
##     sig_cols], adapt = GWRbandwidth, hatmatrix = TRUE, se.fit = TRUE)
## Kernel function: gwr.Gauss 
## Adaptive quantile: 0.9999285 (about 547 of 548 data points)
## Summary of GWR coefficient estimates at data points:
##                    Min.   1st Qu.    Median   3rd Qu.      Max.    Global
## X.Intercept.  6088655.0 6120270.2 6166817.8 6319731.2 6344257.2 6183339.3
## single         -12875.8  -12827.6  -12476.1  -12242.1  -12106.5  -12373.7
## muslim         -18901.7  -18208.7  -17912.6  -17555.4  -17225.3  -17096.8
## highest_quali   11117.2   11480.6   11588.8   11737.1   12170.0   12094.0
## jewish          65184.9   66420.8   67963.5   70085.6   71072.1   70316.2
## asian           -7583.5   -7500.8   -7284.7   -7204.3   -7159.6   -7479.5
## one_car        -40864.5  -40397.0  -38314.1  -37765.0  -37536.6  -38961.7
## no_cars        -45261.9  -44957.5  -44167.7  -43936.7  -43719.8  -44399.1
## Age_30_44      -14652.5  -13714.0  -13312.8  -12977.9  -12410.5  -12501.7
## employed       -15235.4  -14847.0  -14650.6  -14381.1  -13797.6  -15271.2
## private_rent     5505.2    5679.7    6169.0    6293.5    6343.6    5908.7
## Number of data points: 548 
## Effective number of parameters (residual: 2traceS - traceS'S): 13.47969 
## Effective degrees of freedom (residual: 2traceS - traceS'S): 534.5203 
## Sigma (residual: 2traceS - traceS'S): 860638.4 
## Effective number of parameters (model: traceS): 12.30621 
## Effective degrees of freedom (model: traceS): 535.6938 
## Sigma (model: traceS): 859695.2 
## Sigma (ML): 849987.5 
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 16546.15 
## AIC (GWR p. 96, eq. 4.22): 16531.13 
## Residual sum of squares: 3.959183e+14 
## Quasi-global R2: 0.402407

This time, R-squared is higher than in the linear model - local R-squared can be observed on the plot below:

GWR clearly fits north-west part better and gradually lowers the quality of fit going further east.

Let’s further explore the results plotting each variable and its coefficients.


Interpolation

We are now going to try different technique, called interpolation. Because we lack data in some of the OAs, we will use this method to “fill in the gaps”. For the first method we will need to create Thiessen polygons, that actually creates areas similar to our OAs, by assigning boundaries to the closest housing point. We then clip those boundaries to the area of Kensington and Chelsea and finally plot them on the map, filling the areas with price levels.

Looking at the arrangement of polygons we can determine that there are indeed groups of neighbours on south and center-west while the polygons in the north are much more apart. Of course considering the data we used for 2011.


Inverse Distance Weighting (IDW)

Next, we will use IDW. We will convert point data of house prices to numerical values spread over continous surface - mostly for easier and more approachable visualisations how the data is distributed across space. This is also a method of interpolation - just different to usage of Thiessen polygons.

3D Plotting


3D visualisation (especially the interactive version - which cannot be seen in Markdown, but code for it is provided) further reassure our previous remarks.


Conclusion

There were some inisights that we did not expect to come upon. For example, we concluded that there might be more of a wealth division rather than racial - although there are areas in which black/african people constitute for larger percentage of citizens, these are the same areas which have majority of native British citizens. However - there are very distinct areas which are dominated by people from EU with the highest qualifications therefore effecting in lowest unemployment rates. Moreover, immigrants, people of other race than white and people of other religion than christian tend to cluster into their own OAs, completly separate from regions that are populated mostly by white, rather rich and full of highly-educated citizens (which are less concentrated and there is no marked boundary)

Considering our hypotheses though, we confirmed that demographic factors do indeed affect employment rate in given OAs - however during the analysis the addition of spatially lagged variables did not improve the predicitive ability of the models. However, mean prices of houses in the OAs are dependant on spatial factors as well as variables from the same OA - GWR model introduced improved results comparing to simple linear regression.

We acknowledge shortcomings of our work - for example we only analyzed prices of actually SOLD houses. That is however the only data accessible in a well structured format - we would analyze prices of listed houses but that would require a lot of scrapping from sources we are not familiar with. Of course, we would like to also include other districts in our research - maybe compare less diverse than K&C and see whether there are major differences in e.g. house price patterns.


Thank you, hope you enjoyed it.